Importing and Exporting Data

Data can be imported into Google BigQuery from a CSV file stored within Google Cloud Storage, or it can be streamed directly into BigQuery from Python code.

Similarly, the results of a query can be exported to Google Cloud Storage as a set of shards, or they can be streamed directly into a file within Datalab. Note that for larger data sizes, it is recommended to choose the sharded method.


In [1]:
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as storage
import pandas as pd
try:
  from StringIO import StringIO
except ImportError:
  from io import BytesIO as StringIO

Importing Data

The first step to analyzing and querying your data is importing it. For this demo, we'll create a temporary table in a temporary dataset within BigQuery, using a small data file within Cloud Storage.

Importing Data from Cloud Storage

To interact with Google Cloud Storage, Datalab includes the %%gcs command. First, see the available options on %%gcs:


In [2]:
%gcs -h


usage: gcs [-h] {copy,create,delete,list,read,view,write} ...

Execute various Google Cloud Storage related operations. Use "%gcs <command>
-h" for help on a specific command.

positional arguments:
  {copy,create,delete,list,read,view,write}
                        commands
    copy                Copy one or more Google Cloud Storage objects to a
                        different location.
    create              Create one or more Google Cloud Storage buckets.
    delete              Delete one or more Google Cloud Storage buckets or
                        objects.
    list                List buckets in a project, or contents of a bucket.
    read                Read the contents of a Google Cloud Storage object
                        into a Python variable.
    view                View the contents of a Google Cloud Storage object.
    write               Write the value of a Python variable to a Google Cloud
                        Storage object.

optional arguments:
  -h, --help            show this help message and exit
None

Let's use the read option to read a storage object into a local Python variable:


In [3]:
%%gcs read --object gs://cloud-datalab-samples/cars.csv --variable cars

In [4]:
print(cars)


Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture Extended Edition","",4900.00
1999,Chevy,"Venture Extended Edition",Very Large,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00


In [6]:
# Create the schema, conveniently using a DataFrame example.
df = pd.read_csv(StringIO(cars))
schema = bq.Schema.from_data(df)

# Create the dataset
bq.Dataset('importingsample').create()

# Create the table
sample_table = bq.Table('importingsample.cars').create(schema = schema, overwrite = True)

In [27]:
sample_table.load('gs://cloud-datalab-samples/cars.csv', mode='append',
                  source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))


Job mysampleproject/job_RInz5SUINexdU5RUrA2CBgDBsUo completed

In [8]:
%%bq query -n importingSample
SELECT * FROM importingsample.cars

In [9]:
%bq execute -q importingSample


Out[9]:
YearMakeModelDescriptionPrice
1996JeepGrand CherokeeMUST SELL! air, moon roof, loaded4799.0
1997FordE350ac, abs, moon3000.0
1999ChevyVenture Extended EditionVery Large5000.0
1999ChevyVenture Extended Edition4900.0

(rows: 4, time: 1.7s, 228B processed, job: job_v9Pm_m1RXKg5vkpxPmKI0VAnX2s)

Importing Data from a DataFrame


In [10]:
cars2 = storage.Object('cloud-datalab-samples', 'cars2.csv').read_stream()
df2 = pd.read_csv(StringIO(cars2))
df2


Out[10]:
Year Make Model Description Price
0 2010 Honda Civic NaN 15000.0
1 2015 Tesla Model S NaN 64900.0

In [11]:
df2.fillna(value='', inplace=True)
df2


Out[11]:
Year Make Model Description Price
0 2010 Honda Civic 15000.0
1 2015 Tesla Model S 64900.0

In [12]:
sample_table.insert(df2)
sample_table.to_dataframe()


Out[12]:
Year Make Model Description Price
0 1997 Ford E350 ac, abs, moon 3000.0
1 1999 Chevy Venture Extended Edition 4900.0
2 1999 Chevy Venture Extended Edition Very Large 5000.0
3 1996 Jeep Grand Cherokee MUST SELL! air, moon roof, loaded 4799.0
4 2010 Honda Civic 15000.0
5 2015 Tesla Model S 64900.0

Exporting Data

Exporting Data to Cloud Storage


In [25]:
project = Context.default().project_id
sample_bucket_name = project + '-datalab-samples'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/tmp/cars.csv'
print('Bucket: ' + sample_bucket_name)
print('Object: ' + sample_bucket_object)


Bucket: mysampleproject-datalab-samples
Object: gs://mysampleproject-datalab-samples/tmp/cars.csv

In [14]:
sample_bucket = storage.Bucket(sample_bucket_name)
sample_bucket.create()
sample_bucket.exists()


Out[14]:
True

In [26]:
table = bq.Table('importingsample.cars')
table.extract(destination = sample_bucket_object)


Job mysampleproject/job_b6TqQiGuFu5d-8UiB5U5osV0cf8 completed

In [16]:
%gcs list --objects gs://$sample_bucket_name/*


Out[16]:
NameTypeSizeUpdated
tmp/cars.csvapplication/octet-stream2322017-03-07 09:34:30.347000+00:00

In [17]:
bucket = storage.Bucket(sample_bucket_name)

In [18]:
obj = list(bucket.objects())[0]

In [19]:
data = obj.read_stream()

In [20]:
print(data)


Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000
1999,Chevy,Venture Extended Edition,"",4900
1999,Chevy,Venture Extended Edition,Very Large,5000
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799

Exporting Data to a Local File


In [21]:
table.to_file('/tmp/cars.csv')

In [22]:
%%bash
ls -l /tmp/cars.csv


-rw-r--r-- 1 root root 299 Mar  7 09:34 /tmp/cars.csv

In [23]:
lines = None
with open('/tmp/cars.csv') as datafile:
  lines = datafile.readlines()
print(''.join(lines))


Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.0
1999,Chevy,Venture Extended Edition,,4900.0
1999,Chevy,Venture Extended Edition,Very Large,5000.0
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.0
2010,Honda,Civic,,15000.0
2015,Tesla,Model S,,64900.0

Cleanup


In [24]:
sample_bucket.object('tmp/cars.csv').delete()
sample_bucket.delete()
bq.Dataset('importingsample').delete(delete_contents = True)